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ABSTRACT 

The number of categories for action recognition is growing 
rapidly. It is thus becoming increasingly hard to collect suf¬ 
ficient training data to learn conventional models for each 
category. This issue may be ameliorated by the increasingly 
popular “zero-shot learning” (ZSL) paradigm. In this frame¬ 
work a mapping is constructed between visual features and 
a human interpretable semantic description of each category, 
allowing categories to be recognised in the absence of any 
training data. Existing ZSL studies focus primarily on image 
data, and attribute-based semantic representations. In this pa¬ 
per, we address zero-shot recognition in contemporary video 
action recognition tasks, using semantic word vector space 
as the common space to embed videos and category labels. 
This is more challenging because the mapping between the 
semantic space and space-time features of videos containing 
complex actions is more complex and harder to learn. We 
demonstrate that a simple self-training and data augmenta¬ 
tion strategy can significantly improve the efficacy of this 
mapping. Experiments on human action datasets includ¬ 
ing HMDB51 and UCElOl demonstrate that our approach 
achieves the state-of-the-art zero-shot action recognition per¬ 
formance. 

Index Terms — action recognition, zero-shot learning 

1. INTRODUCTION 

The number and complexity of action categories of interest 
to be recognised in videos is growing rapidly. A consequence 
of the growing complexity of actions to be recognised is that 
more training data per category is required to learn suffi¬ 
ciently strong models for complex actions. Meanwhile, the 
growing number of categories means that it will become in¬ 
creasingly difficult to collect sufficient annotated training data 
for each. Moreover the annotation of space-time segments 
of video to train action recognition is more difficult and 
costly than annotating static images. The “zero-shot learn¬ 
ing” (ZSL) paradigm has the potential to ameliorate these 
issues by respectively sharing information across categories; 
and crucially by allowing recognisers for novel categories to 
be constructed based on a human description of the action, 
rather than an extensive collection of training data. 

The ZSL paradigm is most commonly been realised by 
using attributes Q to bridge the semantic gap between low- 


level features (e.g., MBH or HOG) and human class descrip¬ 
tions. Visual to attribute classifiers are learned on an auxiliary 
dataset, and then novel categories are specified by a human in 
terms of their attributes - thus enabling recognition in absence 
of training data for the new categories. With a few exceptions 
Eia, this paradigm has primarily been applied to images 
rather than video action recognition. 

An emerging alternative paradigm to the attribute-centric 
strategy to bridging the semantic gap for ZSL is that of se¬ 
mantic embedding spaces (SES) Eiiiiia. In this case a dis¬ 
tributed representation of text words is generated by a model 
such as an unsupervised neural network 171 trained on a large 
text corpus. This neural network is used to map the text string 
of a category name into a vector space. Regressors (contrast 
classifiers in the attribute case) are then used to map videos 
into this word vector space. Zero-shot recognition is then en¬ 
abled by mapping novel category instances (via regression), 
and novel class names (via the neural network) to this com¬ 
mon space and performing similarity matching. The key ad¬ 
vantage of SES over attribute-centric approaches is that new 
categories can be defined trivially by naming them, without 
the requirement to exhaustively define each class in terms of 
a list of attributes - which grows non-scaleably as the breadth 
of classes to recognise grows. Moreover it allows information 
sharing across categories (via the common regressor), and can 
even be used to improve conventional supervised recognition 
if training samples are sparse 0. 

Although SES-based ZSL is a very attractive paradigm 
for the mentioned reasons, it has not previously been demon¬ 
strated in zero-shot video action recognition. This is for two 
reasons: (i) for many classes of complex actions, the mapping 
from low-level features to semantic embedding space is very 
complex and hard to learn reliably, and (ii) a heavy burden 
is placed on the generalisation capability of these regressors 
which need to learn a single visual to semantic embedding 
space mapping that is general enough to cover all action cat¬ 
egories including unseen action categories. This can be seen 
as the pervasive issue of domain shift between the cate¬ 
gories on which the semantic embedding is trained and the 
disjoint set of categories on which it is applied for zero-shot 
recognition. 

In this paper we show how to use simple data augmenta¬ 
tion and self-training strategies to ameliorate these issues and 
achieve state of the art ZSL performance on contemporary 


video action datasets, HMDB51 and UCFIOI. Our frame¬ 
work also achieves action recognition accuracy comparable to 
the state of the art in the conventional supervised settings. The 
processing pipeline of our framework is illustrated in Figj^ 

2. METHODOLOGY 

2.1. Semantic Embedding Space 

To formalise the problem, we have a task T = {X, Y} where 
X = {xi}i=i...n is the set of dx dimensional low-level space- 
temporal visual feature representations (e.g., MBH and HOG) 
of n videos, including Utr and fits training and testing sam¬ 
ples. Y = {yi}i=i...n is the class names/labels of each in¬ 
stance (e.g. “brush hair” and “handwalk”). We want to es¬ 
tablish a semantic embedding space Z to connect the low- 
level visual features and class labels. In particular we use the 
word2vector neural network [ 7 ] trained on a 100 billion word 
corpus to realise a mapping g : y ^ z that produces a unique 
dz dimensional encoding of each word in the english dictio¬ 
nary and thus any class name of interest in Y. For multi-word 
class names, such as “brush hair” or “ride horse” we generate 
a single vector 2 : by averaging the unique words {yj}j=i...N 
in the description El: 2 : = ^ • g{yj)- 

2.2. Visual to Semantic Mapping 

In order to map videos into the semantic embedding space 
constructed above, we train a regression model f : x ^ 
z from dx dimensional low-level space-time visual feature 
space to the dz dimensional embedding space. The regression 
is trained using training instances = {xi}i=i...ntr 
the corresponding embedding Ztr = g{Ytr) of the instance 
class name y as the target value. Various methods have pre¬ 
viously been used for this task including linear support vec¬ 
tor regression (S VR) la and more complex multi-layer neural 
networks n . Considering the trade-off between accuracy and 
complexity, we choose non-linear SVR with RBF-x^ kernel 
defined by: 

K{xi,Xj) = exp{-^ • V{xi,Xj)) (1) 

where V{xi^Xj) is the distance between histogram based 
representation Xi and Xj lEl. This kernel is effective for 
histogram-based low-level space-time feature representations 
cni that we use. 

2.3. Multi-shot and Zero-shot Learning 

Distances in semantic embedding spaces have been shown to 
be best measured using the cosine metric Eia. Thus we 
normalise each data point in this space with L2, making eu¬ 
clidean distance comparisons effectively correspond to cosine 
distance d{zi^ Zj) = 1 — in this space. 

Multi-shot learning For conventional multi-shot learning, 
we map all data instances X into the semantic space using 


projection Z = /(X), and then simply train SVM classifiers 
with RBF kernel using the L2 normalised projections /(X) 
as data. 

Zero-shot learning For zero-shot learning, test instances 
and classes X* and Y* are disjoint from training classes. I.e., 
no instances of test classes Y* occur in training data Y. For 
each unique test category G Y*, we obtain its semantic 
space projection g{y'^). Then the embedding /(x*) of each 
test instance X* is generated via support vector regressor / 
as described earlier. To classify instances x* of novel cate¬ 
gories, we use nearest neighbour matching: 

y* = axg min \\f{x*) - g{y*)\\ (2) 

y*eY* 

The projections g{y*) can be seen as class prototypes in 
the semantic space. Data instances /(x*) can be directly 
matched against prototypes in this common space. 
Self-training for domain adaptation The change in statis¬ 
tics induced by the disjointness of the training and zero-shot 
testing categories Y and Y* means that regressor / trained 
on X will not be well adapted at zero-shot test time for X*, 
and thus perform poorly a. To ameliorate this domain shift, 
we apply transductive self-training (Eq. to adjust unseen 
class prototypes to be more comparable to the projected data 
points. For each category prototype g{y'') we search for the 
K nearest neighbours among the unlabelled test instance pro¬ 
jections, and re-define the adapted prototype ^(^*) as the av¬ 
erage of the K neighbours. Thus if NNKigiy"^)) denotes the 
set of K nearest neighbours of g{y*), we have: 

1 ^ 

Ky*) ■=j^ Y 

f(x*)eNNK(g(y*)) 

The adapted prototypes g{y*) are now more directly compa¬ 
rable with the test data for matching using Eq. Q. 

2.4. Data Augmentation 

The approach outlined so far relies heavily on the efficacy of 
the low-level feature to semantic space mapping f{x). As dis¬ 
cussed, the mapping is hard to learn well because: (i) actions 
are visually complex and ambiguous, and (ii) even a mapping 
well learned for training categories may not generalise well to 
testing categories as required by ZSL, because the volume of 
training data is small compared to the complexity of a general 
visual to semantic space mapping. The self-training mecha¬ 
nism above addresses the latter to some extent. 

Another way to further mitigate both of these problems is 
via augmentation with any available auxiliary dataset = 
which need not contain classes in common 
with the target dataset . This will provide more data to 
learn a better and more generalisable regressor z = f{x). The 
auxiliary dataset class names Y^^^ are also projected into the 
embedding space with the neural network g{Y^'^^). The aux¬ 
iliary instances X^^^ are aggregated with the target training 




Fig. 1. Illustration of our framework’s processing pipeline. We start by exploiting word2vector ^(•) to project textual labels Y 
into semantic embedding space Z. Then we learn a Support Vector Regression /(•) to map low-level visual features X into 
the semantic spaceZ. By augmenting target training data with auxiliary data X^^^ we can achieve the state-of-the-art 
performance for Zero-shot Learning and competitive performance on Multi-shot Learning 


data Xtr = and to- 

gether to train the regressor /. Although the auxiliary data is 
disjoint from both the target training or zero-shot classes, it 
helps to both better learn the complex visual-semantic space 
mapping and to learn a more generalisable mapping that bet¬ 
ter applies to the held-out zero-shot classes. 

3. EXPERIMENTS 

Datasets: Experiments are performed on HMDB51 mil 
and UCFIOI Cll, two of the largest and most challenging 
action recognition datasets available. HMDB51 has 6766 
videos with 51 categories of actions. UCFIOI has 13320 
videos with 101 categories of actions. 

Visual Feature Encoding: For each video we extract 
dense trajectory descriptors using Ilol and encode Bag of 
Words features. We first compute dense trajectory descriptors 
(DenseTrajectory, HOG, HOF and MBH) then we randomly 
sample 10,000 descriptors from all videos and learn the BoW 
codebook with K-means using K=4000. Thus dx = 4000. 
Semantic Embedding Space: We adopted the skip-gram 
neural network model m trained on the Google News dataset 
(about 100 billion words). This neural network can then en¬ 
code any of approximately 3 million unique worlds as a = 
300 dimension vector. 

Visual to Semantic Mapping: The SVR from visual fea¬ 
ture X to semantic space Z is learned from training data 
with RBF-x^ kernel. The 7 parameter for the kernel is set 
as 7 = -r where Vixi.Xg) is the distance 

function. Slack parameter C for SVR was set to 2. 


out unseen until test time. We randomly generate 30 indepen¬ 
dent splits and take the average mean accuracy and standard 
deviation for fair evaluation. 

Alternative Approaches: We compare our model with 3 al¬ 
ternatives: (1) Random Guess - A lower bound that randomly 
guesses the class label of unseen test samples. (2) Attribute 
Based - the classic Direct Attribute Prediction (DAP) m zero- 
shot recognition strategy. (3) Attribute Based - Indirect At¬ 
tribute Prediction (lAP) m. Note that because attribute an¬ 
notations are only available for UCF dataset, the two attribute 
methods are only tested on UCF. (4) Vanilla semantic word 
vector embedding with Nearest Neighbour (NN) - This is the 
simplest variant of ZSL using the same embedding space as 
our model. Projection / is learned as for our model, and then 
NN matching is applied to classify test instances using the 
unseen class prototypes. (5) Our zero-shot learning approach 
(Sec. |2.3| ), including both Nearest Neighbour Self-Training 
(NN-fST). (Investigation of Data Augmentation is in the fol¬ 
lowing Sec. |3.2| ) 

The results are presented in Tab.[2 All methods are much 
better than random chance, demonstrating successful ZSL. A 
direct application of the embedding space (NN) is reasonable, 
suggesting that the semantic space is effective as a representa¬ 
tion: Videos are successfully mapped near to the correct pro¬ 
totypes in the semantic space. Although NN is not clearly bet¬ 
ter than the attribute-based approaches m , it does not require 
the latter’s extensive and costly attribute annotation. Finally, 
our self-training approach performs best, suggesting that our 
strategy ameliorates some of the domain-shift between train¬ 
ing and testing categories compared to vanilla NN. 


3.1. Zero-shot Learning 

Data Split: Because there is no existing zero-shot learn¬ 
ing evaluation protocol for HMDB51 and UCFIOI action 
datasets, we propose our own spliQ For each dataset, we use 
a 50/50 category split. Semantic space mappings are trained 
on the 50% training categories, and the other 50% are held 

%he data split will be released on our website 


3.2. Zero-shot Learning with Data Augmentation 

Semantic embedding space as an intermediate representation 
enables exploiting multiple datasets to improve the projection 
/ as explained in Sec 2A We next investigate the effect of 
data augmentation across HMDB51 and UCFIOI. 

Zero-shot Learning with Data Augmentation: We follow 
the same zero-shot learning protocol as Sec. mi but augment 
the HMDB51 regressor training with data from UCFIOI and 

















































Table 1. Zero-shot action recognition performance (average 
% accuracy ± standard deviation). 


Method 

HMDB51 

UCFIOI 

Random Guess 

4.0 

2.0 

dapO 

- 

14.3 ± 1.9 

iapCI 

- 

12.8 ±2.0 

NN 

13.0 ±2.7 

10.9 ±1.5 

NN-fST 

15.0 ±3.0 

15.8 ±2.3 

NN Aux 

18.0 ±3.0 

12.7 ± 1.6 

NN ST Aux 

21.2 ±3.0 

18.6 ±2.2 


vice versa. The performance of only data augmentation 
without self-training (NN-i-Aux) and our full model includ¬ 
ing both self-training and data augmentation (NN-fST-fAux) 
are shown in Tab. Overall the both strategies (NN-fAux 
and NN-i-ST-i-Aux) significantly outperform their respective 
baselines (NN and NN-fST) and the full model clearly beats 
the classic attribute-based approaches (DAP and lAP). This 
is attributed to learning a more accurate and generalisable re¬ 
gressor for mapping videos into the semantic space for classi¬ 
fication. Note that NN roughly corresponds to the embedding 
space strategy of (H and {T3 \ . 

Qualitative Illustration: We give insight into our self¬ 
training and data-augmentation contributions in Fig. We 
randomly sample 5 unseen classes from HMDB and project 
all samples from these classes into the semantic space by 
(a) regression trained on target seen class data alone; (b) 
regression trained on target seen data augmented with aux¬ 
iliary (UCFIOI) data. The results are visualised in 2D with 
t-SNEim. Data instances are shown as dots, prototypes 
as diamonds, and self-training adapted prototypes as stars. 
Colours indicate category. 

There are two main observations: (i) Comparing Fig.j^a) 
and (b), we can see that regression trained without auxiliary 
data yields a less accurate projection of unseen data, as in¬ 
stances are projected further from the prototypes: reducing 
NN matching accuracy, (ii) Self-training is effective as the 
adapted prototypes (stars) are closer to the center of the cor¬ 
responding samples (dots) than the original prototypes (dia¬ 
monds). These observations explain our final model’s ZSL 
accuracy improvement on conventional approaches. 

3.3. Multi-shot Learning 
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(a) Regression trained on target seen (b) Regression trained on target data 
data alone. augmented with auxiliary data. 

Fig. 2. A qualitative illustration of ZSL with semantic space 
representation: self-training and data augmentation. 


Table 2. Standard supervised action recognition mean accu¬ 
racy in % on HMDB51 and UCFIOI . * indicates our imple¬ 
mentation_ 


Method 

HMDB51 

UCFIOI 

Low-Level Feature ifTOl 

47.2/46.0* 

75.1 

hlaCSI 

- 

69.7 

Ours 

44.5 

73.7 


use the concatenation of attribute scores as semantic attribute 
space representation. A SVM classifier with RBF kernel is 
trained on attribute representation to predict final labels. 

The resulting accuracies are shown in Tab. We observe 
that our semantic embedding is comparable to the state of 
the art low-level feature-based classification and better than 
the conventional attribute-based intermediate representation. 
This may be due to the attribute-space being less discrimina¬ 
tive than our semantic word space, or due to the reliance on 
human annotation: some annotated attributes may not be de¬ 
tectable, or may be detectable but not discriminative for class. 


4. CONCLUSION 


We finally validate our representation on standard supervised 
(multi-shot) action recognition. We use the standard data 
splits for both HMDBSllfTTl and UCFlOlCa. The action 
recognition accuracy is the average of each fold. 
Alternatives: We compare our approach to: (i) the state of 
the art results based on low-level features Col, (ii) an alterna¬ 
tive semantic space using attributes. To realise the latter we 
use Human-Labelled Attribute (HLA) HE). We train binary 
SVM classifier with RBF-x^ kernel for attribute detection and 


In this paper we investigated semantic-embedding space rep¬ 
resentations for video action recognition for the first time. 
This representation enables projecting visual instances and 
category prototypes into the same space for zero-shot recogni¬ 
tion, however it possess serious challenges of projection com¬ 
plexity and generalisation across domain-shift. We show that 
simple self-training and data augmentation strategies can ad¬ 
dress these challenges and achieve the state of the art results 
for zero-shot action recognition in video. 
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